populace-build: reference-pinned, recorded, reconstructable parity (part of #19)#21
Closed
MaxGhenis wants to merge 1 commit into
Closed
populace-build: reference-pinned, recorded, reconstructable parity (part of #19)#21MaxGhenis wants to merge 1 commit into
MaxGhenis wants to merge 1 commit into
Conversation
…art of #19) Issue #19 found the certified US parity verdict was unreproducible on three counts: the parity reference eCPS was unpinned (gaps=0 was judged against a working-copy eCPS that since drifted), the release manifest recorded no gate result (no gaps count, reference identity, or skipped-layer count), and the runner that produced the verdict was deleted from HEAD in fda3838 (only the gate *library* survived). So "parity 0" was neither reproducible nor drift-detectable. Add packages/populace-build/src/populace/build/parity_reference.py, which re-homes the simulation-level parity runner into the package and pins + records what it judges against: - ReferenceSpec: a frozen dataclass identifying the reference eCPS by sha256 (mandatory) plus either a Hugging Face repo+revision or a local path. An unpinned reference is refused at construction — it is the exact #19 bug. from_local_file() hashes via the shared trace.sha256_file (one hash definition across the build). - judge_parity(): pure. Reuses the surviving populace.build.parity_gate for the verdict (failure lines verbatim, not reinvented) and records the reference identity + gap/skip/populated-layer counts in GateResult.details. Runs with no dataset and no policyengine_us — it takes precomputed share dicts. This is the "judge + record reference identity" half, separated from "gather shares". - gather_candidate_shares(): ports check_parity.py's simulation loop (skip vars not engine-registered or non-annual; non-zero share per var; pop structural weights; a failed sim.calculate is a recorded gap, not a skip), with the Microsimulation isolated behind an injected calculate callable so the skip rules are unit-testable without policyengine_us or a 355MB dataset. - reference_layers(): reads the eCPS's flat var/YEAR HDF5 layers (port of stored_layers), skipping string/object columns and — fixing a latent defect in the original — filtering by year correctly so an off-year dataset (e.g. takes_up_aca_if_eligible/2025) is neither counted under the wrong year nor recorded with a year token as its variable name. Verified against the real frozen reference: 236 true 2024 numeric layers (the original's 237 included one such phantom "2025" key). - run_parity_against_reference(): the thin orchestrator that wires the real sim in (the only path that imports policyengine_us), hashes/pins the reference, and records the verdict. The module docstring states the release manifest should carry this GateResult (via GateReport.to_manifest(), already content-hashed into the build TRO by trace.py) and that a CI drift-check re-running parity against the latest eCPS is the follow-up — not built here. TDD: 25 tests covering the sha256 requirement, gap/pass/known-gap exemption, recorded reference identity (sha256/repo/revision) and counts, the gather skip logic + structural-weight popping over a fake sim, the H5 var/YEAR reader (including year filtering and the entity/variable layout), and a manifest round-trip. No network, no large files. Re-exported from populace.build alongside the gates. Scope: does NOT close the PUF-derived credit-input data gaps or build the CI drift-check job — both remain tracked in #19. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
59c90a2 to
12ac387
Compare
This was referenced Jun 14, 2026
Closed
Contributor
Author
|
Closing out of live Populace. Incumbent/eCPS comparison and reference-drift harnesses now belong in PolicyEngine/populace-benchmarks; the related issues were transferred there as PolicyEngine/populace-benchmarks#1 and #2. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes part of PolicyEngine/populace-benchmarks#1 — the pin + record + reconstructable runner half. The data-gap closure (missing PUF-derived credit inputs) and the CI drift-check job remain open in PolicyEngine/populace-benchmarks#1.
What PolicyEngine/populace-benchmarks#1 found
The certified US default passed its build-time parity gate but the verdict was not reproducible, on three process counts:
gaps=0silently rotted the moment the eCPS changed (and Certified populace-us drifts behind current eCPS on 10 parity layers; pin the parity reference + restore the gate runner populace-benchmarks#1 reproduced exactly that — the certified default fell behind a newer eCPS on 10 layers).fda3838(packages/populace-data/build/us/check_parity.py). The gate library (populace.build.parity_gate) survived, but nothing invoked it against a reference — the contract was only reconstructable from git history.What this PR adds
A new module
packages/populace-build/src/populace/build/parity_reference.pythat re-homes the simulation-level parity runner into the package and makes it reference-pinned and recorded. It reuses the survivingparity_gatefor the judging logic (does not reimplement it):ReferenceSpec— frozen dataclass identifying the reference eCPS by sha256 (mandatory) plus either a Hugging Facerepo+revisionor a localpath. An unpinned reference is refused at construction — that is the exact bug Certified populace-us drifts behind current eCPS on 10 parity layers; pin the parity reference + restore the gate runner populace-benchmarks#1 fixes.from_local_file()hashes via the sharedtrace.sha256_file, so the reference is hashed by the same algorithm the build's TRACE provenance uses for every other artifact.judge_parity(candidate_shares, reference_shares, reference_spec, *, known_gaps=(), skipped=())— pure, fully unit-tested. Delegates the verdict toparity_gate(failure lines verbatim, not reinvented) and records inGateResult.details: the reference identity (reference: sha256 + repo/revision or path + kind),skipped/skipped_layers, andcandidate_populated_layersalongside the base gate'sreference_populated_layers/gaps/exempted. Runs with no 355 MB dataset and nopolicyengine_us— it takes precomputed share dicts. This is the core separation: judge + record reference identity (pure, tested) vs gather shares (needs the sim).gather_candidate_shares(reference_layers, *, year, tax_benefit_system, calculate)— portscheck_parity.py's simulation loop: skip variables the engine does not register or that are non-annual; non-zero share per variable; pop structural weights (household_weight,person_weight); a failedcalculateis recorded as a0.0candidate share (a real gap), not a skip — matching the deleted runner. TheMicrosimulationis isolated behind an injectedcalculatecallable so the skip rules and weight-popping are unit-testable withoutpolicyengine_usor a dataset.reference_layers(path, *, year)— reads the eCPS's flatvar/YEARHDF5 layers (port ofstored_layers), skipping string/object columns and filtering by year correctly. This fixes a latent defect in the original: an off-year dataset (takes_up_aca_if_eligible/2025) was mis-parsed into a phantom layer named"2025". Verified against the real frozen reference (enhanced_cps_2024_hf_main.h5): reads 236 true 2024 numeric layers (the original's 237 included that one phantom key).run_parity_against_reference(candidate_path, reference, *, year, known_gaps=())— thin orchestrator wiring gather → judge: resolves/hashes the reference to a pinnedReferenceSpec, reads its layers, builds the candidate sim (the only path that importspolicyengine_us), injectssim.calculate(var, year).valuesas the seam, and returns the recordedGateResult.Tested vs deferred
Tested (TDD, 25 tests, no network, no large files):
ReferenceSpecrejects construction without a sha256 (andNonesha256, and a HF spec missingrepo/revision);from_local_filehashes the bytes.judge_parity: gap when the reference populates a layer the candidate-shares show 0; pass when the candidate populates;known_gapsexemption honored; reference identity (sha256/repo/revision/path) recorded indetails; layer/skipped counts recorded; failure text identical toparity_gate's.gather_candidate_shares: with a faketax_benefit_systemstub (annual + monthly + unregistered vars) and an injected fake sim, the skip logic, structural-weight popping, and the failed-calculate-is-a-gap rule.reference_layers: flatvar/YEARreader, string-column skipping, year filtering (no off-year/phantom-key leak), and theentity/variablesingle-year layout (realh5py, tiny in-memory files).GateResultserializes viaGateReport.to_manifest()into a manifest-style dict carrying the reference identity, and is JSON-serializable.populace.build.Deferred to follow-up (still tracked in PolicyEngine/populace-benchmarks#1, noted in the module docstring):
run_parity_against_referenceagainst the latest published eCPS on every build so reference drift fails loudly. Not built here.self_employed_health_insurance_aldupstream, the 3 reported aggregates) — separate impute-stage work.Documentation note
The repo has no
docs/tree or changelog tooling, so the release-manifest contract is documented in the module docstring: the producedGateResultshould travel with the release viaGateReport.to_manifest()(already content-hashed into the build's TRACE TRO bytrace.py), and the CI drift-check is the companion follow-up.Deviation from spec
gather_candidate_sharestakes an injectedcalculatecallable (and thetax_benefit_system) rather than acandidate_path. Building theMicrosimulationfrom a path insidegatherwould make it untestable withoutpolicyengine_us— which the spec explicitly forbids for the test suite. The path → sim step lives in therun_parity_against_referenceorchestrator instead, which is a faithful reading of "the sim dependency is isolated here" + "inject the sim via a callable." Same intent, the seam just sits one call out.🤖 Generated with Claude Code